Skip to content

Avoid streaming incomplete UTF-8 characters #727

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 24, 2025

Conversation

corebonts
Copy link
Contributor

Some characters, like the chinese fù is sometimes returned as two tokens, as "\u00e8\u00b5" and "\u008b" in this case.

This is also depends on the model, but when it happens, for example with DeepSeek R1, we have to wait for the character to be complete and send it only then.

This resolves #722 and #646

Some characters, like the chinese fù is sometimes returned as two tokens,
as "\u00e8\u00b5" and "\u008b" in this case.

This is also depends on the model, but when it happens, for example
with DeepSeek R1, we have to wait for the character to be complete and
send it only then.

This resolves Mozilla-Ocho#722 and Mozilla-Ocho#646
@corebonts
Copy link
Contributor Author

corebonts commented Mar 21, 2025

The UTF check is from here:

// check if there is incomplete UTF-8 character at the end
bool incomplete = false;
for (unsigned i = 1; i < 5 && i <= slot.generated_text.size(); ++i)
{
unsigned char c = slot.generated_text[slot.generated_text.size() - i];
if ((c & 0xC0) == 0x80)
{
// continuation byte: 10xxxxxx
continue;
}
if ((c & 0xE0) == 0xC0)
{
// 2-byte character: 110xxxxx ...
incomplete = i < 2;
}
else if ((c & 0xF0) == 0xE0)
{
// 3-byte character: 1110xxxx ...
incomplete = i < 3;
}
else if ((c & 0xF8) == 0xF0)
{
// 4-byte character: 11110xxx ...
incomplete = i < 4;
}
// else 1-byte character or invalid byte
break;
}

@cjpais
Copy link
Collaborator

cjpais commented Mar 24, 2025

I've tested this and it works. We may want to modify the function slightly later, but for now it works from my basic testing, so pulling it in.

Thanks so much for the contribution!

@cjpais cjpais merged commit a9658c7 into Mozilla-Ocho:main Mar 24, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: Chinese coding error in Server V2
2 participants